Dimension reduction is a critical technique in data analysis, which helps to simplify complex dataset while preserving the most meaningful information. One of the most popular method is Principal Component Analysis (PCA) as it can capture the maximum variance and eliminating redundancy. PCA is popular since it has broad applications, from compressing financial dataset to preprocessing textual data.
However, when it comes to image data, the challenges are unique. Images contain high dimension, even a modestly size image (100x100 pixels) contains 10 000 features in gray scale, even more in color. Processing such high-dimensional data is computationally expensive. Therefore, reducing the dimensions of image data becomes an essential step to improve efficiency, and enable further analysis, such as clustering or apply machine learning model.
There are several methods such as t-SNE, autoencoders, and Independent Component Analysis (ICA), can be used for reducing the dimensions of data. Among those methods, PCA has a specialized algorithm called the Eigenface Approach, which is particularly suited for handling facial image datasets. This method identifies the principal components, or “eigenfaces,” that capture the most variance among images, reducing dimensions while preserving essential facial features.
In this project, I will be using the Eigenface approach to analyze and reduce the dimensions of a dataset containing more than 500 face images from different ages and distinct individuals. By applying this method, the project aims to demonstrate how dimensionality reduction can make image processing computationally efficient while retaining critical information.
Each image is treated as a high-dimensional vector. For a gray scale image of size m x n, the pixel values are arranges into a vector of size (m x n). So if the dataset contains N images, these vectors form a data matrix X of size N x (m x n).
The process starts by mean-centering the images and computing a covariance matrix to describe the variability in the dataset. For efficiency, a reduced covariance matrix is computed. The Eigenvectors of this matrix known as eigenfaces, represent the primary directions of variation, while the eigenvalues quantify the amount of variance each eigenface explains. To reduce dimensionality, the top k eigenfaces are selected to form a lower-dimensional subspace, retaining most of the variance.
Images are then projected into this reduced eigenface space, compressing their size while preserving key features.
The dataset containing a total of 552 facial images, ranging from young to elderly individuals, to implement the Eigenfaces algorithm.
library(imager)
library(data.table)
image_files <- list.files("faces", pattern = "\\.jpg$", full.names = TRUE)
images <- lapply(image_files, function(img) {image <- load.image(img)})
length(images)
## [1] 552
range(images)
## [1] 0 1
par(mfrow = c(2, 3))
lapply(images[1:6], function(img) {plot(as.cimg(img))})
## [[1]]
## Image. Width: 309 pix Height: 394 pix Depth: 1 Colour channels: 3
##
## [[2]]
## Image. Width: 309 pix Height: 395 pix Depth: 1 Colour channels: 3
##
## [[3]]
## Image. Width: 309 pix Height: 388 pix Depth: 1 Colour channels: 3
##
## [[4]]
## Image. Width: 309 pix Height: 383 pix Depth: 1 Colour channels: 3
##
## [[5]]
## Image. Width: 309 pix Height: 396 pix Depth: 1 Colour channels: 3
##
## [[6]]
## Image. Width: 304 pix Height: 360 pix Depth: 1 Colour channels: 3
Since PCA operates on numerical data and using grayscale to reduce computational complexity compared to RGB images (3 channels), the images data will first be transformed, then resize to 64x64.
processed_img <- lapply(images, function(img){
grayscale_image <- grayscale(img)
resized_image <- resize(grayscale_image, 64, 64)
return(resized_image)})
dim(processed_img[[1]])
## [1] 64 64 1 1
plot(processed_img[[1]])
image_ranges <- sapply(processed_img, function(processed_img) range(processed_img))
All of the images has been normalize via imager library, so I can proceed with the next step.
The percentage of extreme pixels for each image will be check to detect noises from the images.
extreme_pixel <- sapply(processed_img, function(img) {sum(img == 0 | img == 1) / length(img) * 100})
summary(extreme_pixel)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4395 2.6733 4.0504 6.6895 25.3662
hist(extreme_pixel, breaks = 30, main = "Extreme pixel percentage",
xlab = "Percentage of extreme pixels", col = "lightblue")
The majority of images are clean, most of them have minimal noise (close to 0%), there is only small subset of images might have noise that would need further look into.
noisy_indices <- which(extreme_pixel > 10)
noisy_images <- processed_img[noisy_indices]
length(noisy_images)
## [1] 54
par(mfrow = c(2, 3))
lapply(noisy_images[1:6], plot)
## [[1]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
##
## [[2]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
##
## [[3]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
##
## [[4]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
##
## [[5]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
##
## [[6]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
After considering a certain threshold for extreme pixels, I will removed those images that contain more than 10% extreme pixels as these images all have strong edges or tilted faces which may not align well with the average face representation, that can cause the Eigenfaces algorithm to interpret these images as outliers.
processed_img <- processed_img[-noisy_indices]
Every image from the dataset is a bit map, and in order to proceed with PCA, I would need to flatten the image from 2D to 1D, because all of the instances need to be form a vector of attributes.
vector_img <- do.call(rbind, lapply(processed_img, as.vector))
dim(vector_img)
## [1] 498 4096
vector_img <- as.matrix(vector_img)
# Plot two images from the matrix for checking
image1 <- matrix(vector_img[1, ], nrow = 64, ncol = 64)
image2 <- matrix(vector_img[2, ], nrow = 64, ncol = 64)
par(mfrow = c(1, 2))
plot(as.cimg(image1), main = "Image 1")
plot(as.cimg(image2), main = "Image 2")
Up until now, I am having an organize data matrix with 498 rows, each corresponding to an image in the dataset and 4096 columns reporesents the total number of pixels in each image. Therefore, I will use this matrix to apply principle component analysis.
The principal components from eigenvectors will be computes, along with eigenvalues from the covariance matrix of these vectors. The eigenvectors correspond to the eigenfaces linear combination of the original pixel values that capture the most significant pattern in the images.
pca <- prcomp(vector_img, center = TRUE, scale. = TRUE)
summary(pca)
The standard deviation represents the amount of variance captured by each principal component (PC). Higher values indicate that the corresponding PC captured more variability in data.
The proportion of variance shows the fraction of total variance explained by each principal component, and the cumulative variance (sum of variance up to a specific PC) tells how much total variance is captured by including multiple PCs.
For better understand how eigenfaces look like, I will plot some images that apply this algorithm
eigenfaces <- lapply(1:6, function(i) {matrix(pca$rotation[, i], nrow = 64, ncol = 64)})
par(mfrow = c(2, 3))
lapply(1:6, function(i) {plot(as.cimg(eigenfaces[[i]]), main = paste("Eigenface", i))})
## [[1]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
##
## [[2]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
##
## [[3]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
##
## [[4]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
##
## [[5]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
##
## [[6]]
## Image. Width: 64 pix Height: 64 pix Depth: 1 Colour channels: 1
To have a broader view on choosing the principal components to keep, the cumulative variance will be compute to assist in determine the minimum number of principle components required to retain most of the variability.
cumulative_var <- cumsum(pca$sdev^2 / sum(pca$sdev ^2))
cumulative_var[1:10]
## [1] 0.1960918 0.3290569 0.4057565 0.4536820 0.4995195 0.5331842 0.5643720
## [8] 0.5919035 0.6124758 0.6308212
A scree plot will help to visualize how variance is distributed among principal components that can be used to identify the point where adding more PCs yield diminishing returns.
plot(cumulative_var, type = "b", main = "Cumulative variance",
xlab = "Number of principal components",
ylab = "Cumulative variance explained", col = "grey80")
abline(h = 0.90, col = "darkred", lty = 2)
The cumulative variance increases quickly in the first few principal components and then levels off. The curve flattens significantly by around 50-100 components, indicates that additional components add very little variance. The red dashed line shows the 90% variance threshold, showing where most of the dataset’s variability is captured.
eigenvalues <- pca$sdev^2
fviz_eig(pca, choice = "eigenvalue", ncp = 22, barfill = "firebrick", barcolor = "firebrick", linecolor = "black", addlabels = TRUE)
From the scree plot, after the 5th component, the eigenvalues start to decrease more gradually which suggest that additional component beyond this point contribute less variance. Based on the Kaiser Criterion, the eigenvalues drop below 1 around 22, so components up to 22 may be retained under this criterion.
Different approaches bring out different result, however, in this dataset, I would like to retain as much information as possible. So the threshold would be set to 90% of the variance.
num_components <- which(cumsum(pca$sdev^2 / sum(pca$sdev^2)) >= 0.90)[1]
reduced <- pca$x[, 1:num_components]
The PCA Eigenface method allows reconstruction of images by combining the reduced components (Eigenface coefficients) with the top eigenfaces to demonstrate how well the reduced representation captures the original images.
reconstructed <- reduced %*% t(pca$rotation[, 1:num_components])+pca$center
Compare the original image to the reduced version by using 3 random images
par(mfrow = c(1, 2))
for (i in 1:3) {
original_image <- matrix(vector_img[i, ], nrow = 64, ncol = 64)
plot(as.cimg(original_image), main = paste("Original image", i))
reconstructed_image <- matrix(reconstructed[i, ], nrow = 64, ncol = 64)
plot(as.cimg(reconstructed_image), main = paste("Reconstructed Image", i))}
The original images have more details, the reconstructed one appears smoother, losing some details but the faces are still recognizable.
The mean squared error (MSE) between the original and reconstructed dataset is calculate to assess how well the reduced dimensions can reconstruct the original dataset.
mse <- mean((vector_img - reconstructed)^2)
cat("Reconstruction MSE:", mse, "\n")
## Reconstruction MSE: 0.5124069
To better evaluate the reduced dimension images, instead of processing the entire dataset, I will select a subset of images for evaluation
set.seed(42)
subset_indices <- sample(1:nrow(vector_img), size = 100)
original_subset <- vector_img[subset_indices, ]
reconstructed_subset <- reconstructed[subset_indices, ]
#calculate MSE for each image
mse <- rowMeans((original_subset - reconstructed_subset)^2)
avr_mse <- mean(mse)
cat("Average MSE for the subset:", avr_mse, "\n")
## Average MSE for the subset: 0.5069979
The value of MSE from the whole dataset and from the subset is relatively close, suggesting consistent reconstruction performance across the dataset and the subset. The mean squared error value for subset is slightly lower than the whole dataset, which is expected as the subset may consist of images that are more representeative or easier to reconstruct compared to the entire dataset.
#MSE distribution
hist(mse, breaks = 20, col = "lightblue",
main = "Distribution of reconstruction MSE",
xlab = "MSE", ylab = "Frequency")
During the next step, the best and worst reconstructed images will be check
best <- subset_indices[order(mse)[1:3]]
worst <- subset_indices[order(mse, decreasing = TRUE)[1:3]]
best_or <- vector_img[best, ]
best_rd <- reconstructed[best, ]
worst_or <- vector_img[worst, ]
worst_rd <- reconstructed[worst, ]
visualize <- function(original, reconstructed, title) {
par(mfrow = c(1, 2))
for (i in 1:nrow(original)) {
original_img <- as.cimg(matrix(original[i, ], nrow = 64, ncol = 64))
plot(original_img, main = paste(title, "Original", i))
reconstructed_img <- as.cimg(matrix(reconstructed[i, ], nrow = 64, ncol = 64))
plot(reconstructed_img, main = paste(title, "Reconstructed", i))}}
visualize(best_or, best_rd, "Best")
visualize(worst_or, worst_rd, "Worst")
The worst reconstructed images may not align well with the primary eigenfaces, meaning the PCA model failed to capture the unique variations. As can be notice, those worst images are probably due to the lighting conditions, as the contrast of images is too high or too low, or have shadows on the face or overly bright areas.
Using classification method of Nearest Neighbor Classifier by splitting the processed images to train and test data set to 70:30 ratio, I would like to evaluate the classifier accuracy and run time in an Eigenface-based project with reduced-dimension, and then compare it with the full-dimensional dataset.
Since this dataset does not contain labels, I will generate labels for data by apply k-means clustering method with number of k equal 5 and apply it on train and test dataset.
#generate labels with k = 5
k <- 5
labels <- kmeans(vector_img, centers = k)$cluster
#split data
set.seed(123)
training <- sample(1:nrow(vector_img), size = 0.7 * nrow(vector_img))
testing <- setdiff(1:nrow(vector_img), training)
#label for train and test data
train_labels <- as.factor(labels[training])
test_labels <- as.factor(labels[testing])
trainset <- vector_img[training, ]
testset <- vector_img[testing, ]
The PCA algorithms will be apply to the training dataset. The transformed dataset is then projected onto a 95-component (as it keeps 90% of variances from the data) space. After that, the train-test data will be used for classification task.
#apply pca
pcatrain <- prcomp(trainset, center = TRUE, scale. = TRUE)
proj_train <- as.data.frame(pcatrain$x)
proj_test <- as.data.frame(predict(pcatrain, newdata = testset))
#reduce the components of train and test data to 95 components
proj_train <- proj_train[, 1:num_components, drop = FALSE]
proj_test <- proj_test[, 1:num_components, drop = FALSE]
Calculate computational time needed to run classification on the 95 components images
library(class)
start_pca <- Sys.time()
predictions_pca <- knn(train = proj_train, test = proj_test, cl = train_labels, k = 3)
accuracy_pca <- sum(predictions_pca == test_labels) / length(test_labels)
end_pca <- Sys.time()
time_pca <- end_pca - start_pca
cat("PCA Accuracy:", accuracy_pca)
## PCA Accuracy: 0.74
cat("Runtime of PCA-reduced dataset:", time_pca, "\n\n")
## Runtime of PCA-reduced dataset: 0.006309032
Runtime and accuracy for classification with the complete images dimensions
start_org <- Sys.time()
predictions_org <- knn(train = trainset, test = testset, cl = train_labels, k = 3)
accuracy_org <- sum(predictions_org == test_labels) / length(test_labels)
end_org <- Sys.time()
time <- end_org - start_org
cat("Accuracy of classification on full-dimensional dataset:", accuracy_org)
## Accuracy of classification on full-dimensional dataset: 0.7133333
cat("Runtime of full-dimensional dataset:", time, "\n\n")
## Runtime of full-dimensional dataset: 0.3703301
As a result, the classification using the full-dimensional dataset took significantly longer than the PCA-reduced dataset, which is predictable as it has more dimensions that need more computational time. It is worth mentioning that in this case, using PCA helps improve the efficiency without compromising accuracy. The reason might be feature reduction through PCA eliminates less important dimensions, leading to a cleaner and more discriminative dataset for classification.
The power of eigenfaces in data compression is remarkable as it is retaining critical information for effective face representation. In this project, from the high dimensional data of (4096 dimensions for 64x64 pixels), with only 95 eigenvectors (95 dimensions), it can reconstruct the image with notable similarity to the original and can be use for further machine learning application.
This approach not only facilitates efficient data storage but also improves the performance of machine learning models. Since reduced dimensionality eliminates noise and redundancy, classifiers can generalize more effectively when operating in a lower-dimensional space. Overall, this project demonstrates the effectiveness of eigenfaces for compact data representation in facial recognition tasks.